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I.  INTRODUCTION 

The  process  of  Indexing  incoming  documents  (generally  by 
assigning  terms,  l.e.,  words  o.  word  phrases,  tu  them  to  Indicate  their 
content)  is  common  to  most  retrieval  systems,  Clesslf ication  >r  assign¬ 
ment  of  such  terms  or  documents  to  desses,  is  likewise  an  important  operation 
in  such  systems.  ?or  example,  e  very  rudimentary  class l  f Ication  system 
might  consist  of  classes  of  the  form  "all  documents  indexed  by  the 
term  X". 

There  are  various  device*  which  supplement  the  basic  process  of 

Indexing  end  thereby  Increase  retrieval  effectiveness.  These  are  eummariscd 

by  Cleverdon1,  who  la  In  the  process  of  testing  their  comparative  utility. 
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Sslton's  SMART  system  is  also  designed  to  enable  testing  of  such  devices, 

2 

Doyle  has  pointed  out  two  somewhat  contraatlng  approaches  to  information 
retrieval,  each  one  involving  a  device  or  a  group  of  devices  for 
generating  additional  information  to  aid  in  retrieval. 

The  first  approach  is  to  concentrate  on  the  o less! f ication  of 
documents  or  terms,  perhaps  using  a  hierarchy  (which  might  be  automatically 
generated)  to  exhibit  various  relations  among  them.  Ig*  other  approach 
builds  in  the  concept  of  coordinate  indexing,  which  permits  requests  for 
documents  indexed  by  logical  combinations  of  terms.  The  extensions  to 


coordinate  indexing  which  form  this  second  approach  are  probabilistic 
indexing  and  statistics!  association  techniques. 
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Probabilistic  indexing  was  proposed  In  1960  by  Marcn  and  Kuhns  . 

The  basic  idea  is  that  in  the  process  of  manually  indexing  a  document,  the 
indexer  associates  with  each  document  a  ant  of  ordered  pairs  rather  than  a 
set  of  terms.  The  first  element  of  each  ordered  pair  is  a  term  and  the  aecoivl 
a  subjective  estimate  of  the  relevance  of  the  term  to  the  document,  or  more 
precisely,  the  probability  that  the  document  will  be  considered  relevant  to 
a  request  containing  that  term,  This  assignment  permits  the  computation  of 
a  relevance  score  of  a  document  relative  to  a  request. 

In  the  use  of  statistical  association  techniques,  a  coefficient 

of  association  la  computed  for  each  pair  of  term,  indicating  the  degree 

to  which  the  two  terms  co-occur  in  the  document  collection.  This 

co-occurrence  can  be  either  In  the  list  of  terms  for  a  particular  document 

(the  document's  index  aet)  or  by  association  in  the  docu<i,,..L  text,  say, 

by  appearing  within  a  certain  distance  of  each  other.  This  tendency  of 

certain  terms  to  co-occur  provides  a  relationship  which  cm.  be  exploited 

in  retrieval:  a  search  can  be  made  not  only  for  documents  Indexed  by  the 

requested  terms,  but  also  for  documents  indexed  by  terms  closely  associated 
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to  the  specified  ones  „  As  a  result,  documents  may  be  retrieved  which  are 
useful  to  the  searcher  even  though  indexed  by  a  combination  of  t*.rms  he 
would  not  have  thought  of  eugge sting. 

Historically,  adherents  of  the  two  approaches  have  often  formed 
opposing  camps.  Several  years  ago,  there  was  much  emphasis  among  computer- 
oriented  people  on  the  coordinate  Indexing  approach  as  a  solution  to 
deficiencies  of  classification  systems  such  as  their  difficulty  in 
handling  the  coupling  between  interdisciplinary  fields.  More  recently, 
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the  assistance  which  a  classification  system  can  provide  a  requester  haa 
lad  to  raaearch  on  automatic  claaaificatlon,  Statlatical  aaaociatlon  of 
tarma  haa  aarvad  to  eons  extent  aa  a  da  baewaan  tha  cwo  approach* a,  alaca 
in  addition  to  its  use  in  tha  firat  araa  (in  expanding  request*  to  include 
associated  terns) ,  it  ia  used  in  many  ache** a  for  the  automatic  generation 
of  claBaification  systems.  Doyle  bellavaa  that  a  synthesis  of  the  two 
approachaa  will  provide  pertlculerly  eetlefactory  reeulta.  It  ia 
intereeting  to  note  that  Cleverdon'e  raaearch  haa  led  him  to  tha  opinion 
that  for  e  well  Indexed  eat  of  document#,  aoat  of  the  various  devices 
or  approaches  ere  potentially  capable  of  equally  good  performance,  and 
moreover,  tha  basic  process  of  Indexing  ie  not  crucial  c«»  good  retrieval. 

The  purpose  of  this  document  is  to  review  prlog  research  in  these 
two  erase,  relate  it  to  Project  VKCTOR/BOSK,  end  to  recoawand  experiments 
for  determining  the  extent  to  which  word  association  technique*  can  ha 
introduced  into  tha  Moora  School  Information  Systeai. 


i  ..  i..n£XU«i  AhJ  VOC«BdLA»Y  GENERATION  CuNUIDEIUi'lUhN 

Before  going  Into  details  on  ths  two  approachi  «  m  Information 

retrieval,  bo rh  of  which  atart  with  indexed  documents,  It  miK.ht  be 

beneficial  to  aurvey  the  reaearch  in  automatic  indexing.  The  claaaic 
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article  on  thla  topic  la  that  of  Edmundaon  and  Wyllys  ,  in  which  they 
suggest  the.  the  index  terms  for  a  document  could  be  the  terms  whose 
relative  frequency  in  the  document  te  significantly  higher  then  their 
relative  frequency  in  the  literature  as  a  whole.  Experimental  results 
reported  by  Daaer&u^  are  encouraging.  O'Connor6  organized  en  experiment 
in  the  automatic  assigning  of  two  madical  terms  to  document,  in  which 
they  do  not  necaaaarily  appear.  The  approach  ia  somewhat  analogous  to 
searching  for  dpeuments  to  supply  Information  on  these  terms  in  en 
information  retrieval  system  Incorporating  statistical  associations,  e 
thesaurus,  and  various  other  aids.  O'Connor's  survey  article7 
discusees  various  operation*  (such  «s  indication  of  antecedents  of 
pronouns)  that  might  be  done  on  a  document  prior  to  the  actual  process 
in  which  index  term?  are  assigned,  and  glvas  the  impress that  adequate 
completely  automatic  indexing  is  some  distance  in  the  future. 

A  closely  ellled  topic  which  should  be  mentioned  st  this 
point  is  automatic  vocabulary  generation.  Here  the  object  is  to  find  the 
words  which  are  important  in  the  document  collection  ea  a  whole  rather 
thAn  the  important  words  in  a  single  document,  A  method  for  doing  this 

fl 

is  treated  by  Dennis  ,  The  hypothesis  la  that  the  «v.v  informative  words 
will  have  roughly  the  same  relative  frequency  within  *11  the  documents, 
but  that  Che  informative  words  will  have  s  skewed  distribution  of  rslatlve 
frequency  ,  i.#.,  in  some  documents  they  will  appear  with  a  c comparatively 
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nigh  ircquency  but  ia  woat  docussanta  they  vlil  have  a  very  lev  frequency 
or  not  appear  at  all.  Note  tha  relation  batvaan  this  uaa  of  relative 
frequency  and  the  approach  of  Edaundton  and  Wyllya.  The  hypothatla 
lies  tested  by  Dennia  on  a  eat  of  legal  literature  and  the  Mature 
of  tkawnaae  proved  to  be  a  better  teat  of  isportanee  than  several 
other  Matures. 
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III.  STATISTICAL  ASSOCIATION  T8CHNIQUBS 

Of  the  two  approach*  Co  information  retrieval  which  Cult 
report  analyzes,  one  (eutoaatic  classification)  he*  no  tingle  pioneer 
ppper .  For  the  other  {statistical  association  technique*,  including 
what  Doyle  cell*  associativa  machine  searching),  Che  1^60  peper  by 
Hsron  end  Kuhn*  ie  *  landmark.  Their  concept  of  probabilistic  indexing 
was  discussed  briefly  earlier.  This  seme  peper  <1*0  Introduce*  the 
idea  of  etntiatical  association  of  index  terse.  It  auggeats  that  If  two 
terms  "X"  and  "Y"  appear  together  more  frequently  in  the  set  of  term* 
indexing  e  document  then  they  would  by  chance,  then  *  request  containing 
"X"  could  he  expanded  to  "X"  OR  'Tf",  thu#  proving  recall  (the  fraction 
of  relevant  documents  retrieved).  Suppose  someone  ie  interested  in 
documents  about  “cartoons".  If  the  closely  associated  terms  "animation", 
"Walt  Disney",  etc,,  are  added  to  Che  request  perhaps  automatically, 
relevant  documents  which  did  not  happen  to  be  indexed  under  "cartoone" 
might  be  retrieved.  The  computation  of  a  relevance  acore  for  documents 
with  respect  to  a  request  permits  discarding  of  documents  with  e  low 
acore  end  hence  allows  soma  control  over  precision  (the  fraction  of 
retrieved  documents  which  ere  relevant) .  The  documents  which  have  been 
discarded  can  be  retrieved,  if  desired,  in  order  of  decreasing  relevance 
acore. 

Another  important  early  paper  on  statistical  association 
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techniques  is  that  of  Stiles  .  He  also  makes  association  of  a  pair 
of  terms  depend  on  their  co-occurrence  in  index  sets  of  documents,  end 
cells  his  measure  of  the  association  of  two  terms  their  "association 
factor".  Ke  points  out  that  a  pair  of  synonyms  may  have  a  fairly  low 

-  6  - 


asaoo  i-wion  tactor,  sln'r*  f-Vy  win  ■•■..*.  c.  ,  u«s  u  io<J  t',  index  ;  .  i 
same  document*.  But  on  the  other  hand,  they  *r«  both  likely  to  have 
high  associations  with  ralatad  term*.  For  example,  "movie"  and  "action 
plcturs"  might  have  a  low  association  factor,  but  they  would  both  be 
highly  associated  with  “production",  “fils'*,  “theater",  etc.  Stile* 
would  i^ey  “movie"  end  "notion  picture"  heve  e  high  second  generetion 
aaaociation.  The  system  proposed  by  Stile*  expend*  requests  to 
include  first  and  second  generation  asaoeiated  terms,  and  computes  document 
relevance  scores  baaed  on  the  association  factors  between  the  tens*  of 


the  document's  index  set  end  the  terms  of  the  expended  request. 

In  later  papers  Stiles  advocates  having  the  requeater  assign 
weights  to  the  terms  he  specified,  then  presenting  him  with  a  list 
of  associated  terms  end  permitting  him  to  edd  to  his  original  list 
or  to  revise  weight*.  Experimental  results  show  that  this  approach 
can  provide  greater  recall  than  pure  coordinate  indexing,  end  (ee  in 
Meron  end  Kuhn's  system)  the  user  c*n  examine  the  retrieved  documents 
in  order  of  their  estimated  relevance  to  his  request. 

Doyle  ’  proposed  the  concept  of  en  association  map,  e 
two -dime ns luo* l  display  in  which  highly  associated  terms  would  be 
close  to  each  other  end  would  be  joined  by  a  line.  The  patterns  of 
associations  presented  by  such  a  map  could  suggest  additional  terms  that 
a  requester  might  add  to  hia  request,  or  could  aid  in  the  manual 
compilation  of  a  thesaurus.  Doyle  later  came  to  baliave  that  a 
hierarchical  association  map,  in  which  term  hierarchies  based  on 
relative  frequency  of  occurrence  are  displayed  more  prominently  than 
pure  statistical  association,  would  be  more  effective  Chen  e  map  which 
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o only  statistical  associations.  Holt  recently  Mb  attention 
has  been  centered  on  e  different  schema  for  s nidus tic  hierarchy 
generation,  discussed  in  the  next  section. 

A  detailed  treatment  s£  the  mathematics  of  associative 
retrieval  came  from  Gulliano  and  Jones  in  1963  They  propose  a  model 

in  which  a  linear  transformation  of  a  request  vector  results  in  a 
response  vector.  The  components  of  the  request  vector  are  the  vaighte 
assigned  to  the  different  terms  by  the  requester,  end  the  component# 
of  the  response  vector  ere  relevence  scores  for  the  documents  in  the 
system.  The  request  vector  is  first  premultiplied  by  on  index  term 
association  matrix,  resulting  in  what  can  be  considered  a  modified 
request  vector:  the  components  corresponding  to  terms  net  In  the 
original  request  but  highly  associated  to  those  terms  wie|  now  hsve 
a  non*zero  value.  This  vector  will  then  be  smltlplied  by  a  tsrm-document 
connection  matrix,  whose  typical  element  c^  could  be  the  weight 
assigned  to  term  )  as  it  Indexes  document  i,  obtained  by  probebilletie 
indexing.  The  resulting  vector  will  be  in  the  form  of  e  response 
vector.  Finally,  sn  optional  multiplication  by  a  document  assoc is cion 
matrix  could  adjust  the  components  (relevance  scores)  to  take  into 
account  associations  between  documents.  Various  method  of  obtaining 
the  tutrices  associated  with  the  transformation  ere  discussed,  end  an 
electrical  network  analog  for  accomplishing  tha  transformation  is 
explained.  In  the  letter,  the  components  of  the  requmgt  vector  ere 
supplied  to  tha  network  as  currents,  and  tha  response  components  ere 
produced  as  voltages. 
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j.-v.  01  j».kLibLic*i  association  bcci.nl y>*- *  ih  on*  of  the 
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options  of  the  SMART  Retrieval  System  developed  by  Selton  at  Harvard  . 
Association  coefficients  can  be  computed  using  either  co-occurrence 
of  a  pair  of  terms  or  concepts  in  a  sentence  or  co-occurrence  in  the 
Index  set  of  a  document.  Selton  calls  a  group  of  closely  associated 
terms  a  cluster.  If  one  of  the  tenaa  in  e  cluster  le  pert  of  e 
request,  the  user  may,  if  he  wishes,  add  the  other  terms  of  the 
cluster  to  the  request. 

Several  reports  on  recent  research  in  the  area  of  statistical 
association  techniques  as  veil  aa  automatic  classification  are  contained 
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in  a  survey  by  Mary  B.  Stevens  .  Discussion  of  possible  applications 
of  statistical  association  techniques  to  Project  VECTOR/ ROSE  vlll  be 
held  until  Section  V. 
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i.V,  AUTOMATIC  CLASSIFICATION 

Two  types  of  classification  should  be  distinguished: 
classification  of  concepts  or  terms  end  classification  of  documents. 

A  clsssificstion  system  for  concepts  la  often  a  set  of  successive 
partitions  of  the  knowledge  to  be  classified.  These  successive  partitions 
can  be  viewed  as  a  tree  structure  where  the  top  node  (the  root) 
represents  the  whole  field  of  knowledge,  Che  nodes  one  level  down 
represent  the  large  subdivisions  of  the  field,  each  set  of  nodes  on 
the  third  level  represents  a  petition  of  a  large  subdivision,  etc. 

The  relation  of  a  node  to  its  succt.iaora  Is  that  uf  generic  concept 
to  more  specific  concept.  Cenarally  the  nodes  ere  labeled  end  all 
concepts  or  terms  can  be  found  as  the  label  of  soma  node  in  the 
hierarchy.  One  of  the  criticisms  of  classification  systems  is  that  it 
is  often  difficult  to  assign  just  one  predecessor  node,  tor  example, 
if  an  interdisciplinary  concept  la  involved.  Due  to  problema  such  as 
this,  several  classification  schemes  permit  a  concept  to  be  an 
immediate  member  of  more  Mn  one  class,  i.e.,  they  allow  a  node  to 
have  more  than  one  imuedi-et  predecessor. 

Classification  of  documents  la  s  somewhat  different  idee, 
though  often  e  concept  classification  system  is  adapted  agd  a  document 
is  assigned  to  the  class  corresponding  to  its  main  concept.  However, 
documents  could  be  grouped  according  to  the  sladlarlty  of  their  index 
seta,  hence  avoiding  explicit  use  of  s  concept  classification. 

One  of  the  purposes  of  classification  is  to  aid  in  assigning 
and  identifying  the  physical  location  of  documents  in  storage.  But 
it  also  has  another  use:  if  a  concept  hierarchy  la  available,  a  user 
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can  locate  the  tern*  he  wee  thinking  of  putting  in  his  request,  and  upon 
inapectlon  of  sore  ganaral  or  (tore  specif lc  terra*  or  parallel  tarn*  on 
the  tree,  he  mgy  altar  hie  request  to  improve  hie  chances  of  retrieving 
what  he  wants.  Similarly,  if  there  is  a  document  classification  scheme 
and  the  user  knows  one  document  that  la  relevant  to  his  meads,  ha  nay 
want  to  Inspect  all  the  other  documents  In  the  same  class. 

Automatic  classification  schemes  fall  into  tw^  categories; 
automatic  placement  of  documents  or  tensa  in  a  predetermined  classification 
system,  or  automatic  generation  of  the  classification  system  and  placement 
of  documents  or  terms  in  It,  Host  of  the  aystama  take  indexed  documents 
aa  input  and  make  use  of  the  statistical  association  of  t'erms  or  soma 
related  concept.  The  majority  of  ths  systems  work  with  a  two-level 
hierarchy,  corresponding  to  a  single  partition. 

Another  way  of  looking  at  classification  systems  la  via 
vector  spaces.  Suppose  all  knowledge  la  considered  to  be  an  m-dlmenaionr.l 
vector  apace.  Then  one  mi^ht  represent  concepts  as  points  in  this 
apace,  and  similar  concepts  would  be  closer  together  than  unrelated  ones. 

The  successive  partitions  could  be  visualised  aa  seta  of  ‘hypereurfaces 
In  the  space.  (Material  relevant  to  this  viewpoint  la  contained  in 
some  of  the  literature  on  pattern  recognition.)  Sow  of  the  people  who 
view  classification  In  this  way  are  not  very  explicit  about  what  the 

axes  would  be  or  hew  points  are  located  in  Che  space,  Heron  and  Kuhns, 
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and  Hayes  .  Borko's  facto*  analysis  approach  results  in  a  apace  of 
smaller  dimension  than  the  original  in  which  emuapts  are  a xae  rather  than 
points,  and  document*  are  represented  aa  points.  A  request  can  be  viewed 
in  this  framavork  as  a  point  or  set  of  points  In  the  vector  space,  sad 
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tha  retrieval  problem  can  be  characterised  at  find  inf  document  point*  naar 
to  tha  request  point  or  point*  by  some  diatanca  measure.  Maron  and  Kuhns 
eonsidar  tha  request  spaca  and  document  *pac*  to  ha  distinct  and 
retrieval  to  ba  a  napping  from  tha  fornar  to  tha  latter. 

Karon ^  proposed  ona  of  tha  earliest  Methods  for  automatic 
classification  of  documents,  utilising  the  occurrence  of  key  words  in 
abstracts  sad  Bayesian  probability  formulas.  Ha  ueed  a  predetermined 
classification  ayetem,  as  did  Williams.  Aa  an  ax tans ion  of  Edmund son  and 
Wylly's  idea  that  each  field  of  knowledge  la  characterised  by  a  special 
distribution  of  word  frequencies,  Wiillasis^  txparimantnd  with  ths  use 
of  multipl#  discriminant  functions  in  tha  automatic  classification  of 
documents.  This  procedure  seemed  to  ba  quits  successful  in  classifying 
solid  state  physics  abstracts. 

Ona  of  the  esrlieat  approaches  to  automatic  classification  of 

terms,  and  one  which  involved  automatic  generation  of  the  catagorlaa, 

was  pursued  by  Needham,  parker-Rhodea,  at  al.  at  the  Cambridge  Language 
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Research  Unit  in  England  *  .  Their  system  forma  non-uxcluaivs 

subsets  (called  clumps)  of  a  tec  of  tense,  auch  that  terms  in  a  givan 
subset  are  more  closely  related  to  each  other  than  to  terms  in  other 
subset*.  Their  idea  ia  that  documents  can  be  indexed  by  clumpa  rather 
than  Individual  tense.  Another  approach,  due  to  Borko  and  Bernlck 
Is  to  uaa  factor  analysis  to  group  highly  correlated  terms  into  factors 
or  categories.  Documents  are  automatically  assigned  to  ona  of  tha 
cacc,;oriea  derived.  Latent  elasa  analysis,  proposed  by  Bdker^,  is  another 
scheme  for  reducing  the  large  aat  of  terms  to  a  smaller  sat  of  clasaas. 

This  process  utilizes  correlations  of  not  only  pairs  of  terms,  but  also 
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triplet,  or  even  larger  seta,  if  desired.  The  computation  for 
determining  to  which  "latent  data"  to  assign  a  docmaent  la  relatively 
straightforward,  winter/ hat  proposed  a  modification  in  Baker's 
scheme  to  make  computation  more  practical,  Mo  experimental  results 
have  appeared  yet  for  this  method,  though  they  are  promised  by 
Winters, 

Lafkovltr  and  Pryves2^*2^  hava  proposad  a  term  hierarchy 
generating  process,  using  what  they  call  Inclusive  and  exclusive 

« 

partitions,  this  hierarchy,  which  can  have  more  than  two  levels  slds 
in  the  process  of  request  formulation  by  stssrlng  the  requester  away 
from  requests  specifying  co-occurrence  of  terms  which  do  npt  co-occur 
In  ths  document  collection.  Documents  ere  not  explicitly  classified 
in  this  system. 

Doyle2*1  has  experimented  with  s  multi-level  hierarchy  generation 
program  written  by  Ward  and  Hook.  This  program  operates  os  Indexed 
documents,  first  grouping  documents  whose  sets  of  terms  srs  most  nearly 
alike,  then  forming  larger  and  larger  groups  in  such  an  order  that  at 
each  ecage  of  the  process,  the  most  similar  remaining  groups  srs  combined, 
and  finally  the  last  two  groups  era  joined.  This  process  is  in  s  sense  the 
inverse  of  successive  partitions.  Bach  group  at  any  stage  cen  be 
thought  of  as  s  node  in  a  tree,  whose  successor  nodea  ere  the  nodes 
representing  the  groups  out  of  which  It  was  formed.  Doyle  attempts 
to  identify  the  nodes  of  this  tree  with  concepts,  but  with  limited 
success. 

Other  work  on  automatic  classification  is  reported  in 
Steven's  Statistical  Association  Methods 
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V,  EVALUATION,  POSSIBLE  APPLICATIONS.  AMD  S2Ct»flSCiDATICNS 

The  experimental  work  which  he*  been  done  in  the  are*  of 
automatic  classification  has  not  bean  evary  extensive  and  has  been 
more  concerned  with  comparison  of  the  reaulta  of  automatic  classification 
versus  manual  classification  than  with  invastigation  of  how  automatically 
derived  classification  can  facilitata  retrieval.  Research  in  generation 
of  two-level  hierarchies  or  placement  of  documents  in  them  is  further 
along  the  road  to  practical  application  than  research  in  multi-level 
hierarchies.  But  a  two-level  classification  system,  while  en  important 
step  toward  a  system  with  more  than  two  levels ,  will  probably  not  have 
a  vary  great  impact  on  information  retrieval  systems;  what  it  does  can 
be,  and  often  la,  done  manually  without  a  great  expenditure  of  effort. 

A  multi-level  classification  system,  in  which  nodes  would  represent 
concepts  or  terms  and  in  which  the  tree  structure  would  display 
relations  such  as  generic-specific,  seems  to  be  considerably  further 
from  realization  than  a  two-level  scheme. 

Needham,  however,  has  an  interesting  basis  for  an  optimistic 
viewpoint.  He  says  that  human  classifiers  of  terms  know  too  much 
about  the  meanings  of  the  terms  and  their  relatione  to  each  other,  and 
may  make  a  classification  system  which  is  more  detailed  than  necessary 
for  a  particular  set  of  documents.  The  computer,  having  at  its  disposal 
only  the  statistical  data  for  the  document  sat  of  interest,  will  not  maka 
a  classification  schedule  with  unnecessary  detail,  because  it  will  have 
no  other  source  of  information  to  draw  on.  The  obvious  question  la: 
will  this  schedule  have  enough  detail  to  be  useful  to  people! 
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Cua  problem  which  lurk*  in  the  background  of  moat  statistical 
•••Delation  techniques  1*  that  the  number  of  possible  essocistlone  of 
n  tttw  with  each  other,  excluding  self-associations,  li  . 

Thua,  for  largo  n,  the  matrix  manipulations  can  become  quite  unwieldy, 
even  when  using  special  technique*  fpr  dealing  with  sparse  matrices. 

The  following  la  a  Hat  of  poaalble  applicatlona  of  statistical 
association  technique*  to  Project  VECTOh/iOSS: 

A.  In  system  establishment 

1.  Identification  of  synonyms  and  near -synonyms,  hence 
thesaurus  generation. 

2.  Use  of  diagonal  elements  of  the  square  of  the  word 
association  matrix,  which  correspond  roughly  to  how 
tightly  bound  a  word  ia  to  the  other  worda  in  a 
document  set,  in  determining  what  words  should  be 
included  in  the  controlled  vocabulary  or  chooaina  e 
th<saaur  from  a  synonym  sat;  this  is  an  alternative 
to  be  compared  with  Dennis's  vocabulary  generation 
approach,  etc. 

3.  Determination  of  what  terms  to  pre-coordinate  or  link 
by  scanning  of  text  or  abstracts. 

4.  (possibly)  Automatic  generation  of  a  classification 
system  for  tanas  or  documents. 

5.  Providing  date  fov  man-machine  indexing  or  claeaifleatloa* 

6.  Providing  data  for  selection  of  documents  to  include 
in  the  system. 

B.  In  system  operation 

1.  Automatic  expansion  of  a  request  to  include  as socle tod 
terms. 

2.  Assisting  tbe  user  in  formulation  of  request  by 
displaying  associated  terms  and  the  relationships 
through  which  they  ere  associated. 
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